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Abstract 

We tackle the problem of predicting the performance of MapRe¬ 
duce applications designing accurate progress indicators, which 
keep programmers informed on the percentage of completed com¬ 
putation time during the execution of a job. Through extensive ex¬ 
periments, we show that state-of-the-art progress indicators (in¬ 
cluding the one provided by Hadoop) can be seriously harmed 
by data skewness, load unbalancing, and straggling tasks. This is 
mainly due to their implicit assumption that the running time de¬ 
pends linearly on the input size. We thus design a novel profile- 
guided progress indicator, called NearestFit, that operates with¬ 
out the linear hypothesis assumption and exploits a careful combi¬ 
nation of nearest neighbor regression and statistical curve fitting 
techniques. Our theoretical progress model requires fine-grained 
profile data, that can be very difficult to manage in practice. To 
overcome this issue, we resort to computing accurate approxi¬ 
mations for some of the quantities used in our model through 
space- and time-efficient data streaming algorithms. We imple¬ 
mented NearestFit on top of Hadoop 2.6.0. An extensive em¬ 
pirical assessment over the Amazon EC2 platform on a variety of 
real-world benchmarks shows that NearestFit is practical w.r.t. 
space and time overheads and that its accuracy is generally very 
good, even in scenarios where competitors incur non-negligible er¬ 
rors and wide prediction fluctuations. Overall, NearestFit signif¬ 
icantly improves the current state-of-art on progress analysis for 
MapReduce. 

Categories and Subject Descriptors D.2.8 [Software engineer¬ 
ing]: Metrics - performance measures; D.2.5 [Software engineer¬ 
ing]: Testing and debugging - distributed debugging; C.4 [Perfor¬ 
mance of systems]: Measurement techniques 

Keywords MapReduce, Hadoop, performance profiling, perfor¬ 
mance prediction, progress indicators, data skewness, nearest- 
neighbor regression, curve fitting. 

1. Introduction 

The ability to perform scalable and timely processing of massive 
datasets is a crucial endeavor of our era. To handle increasing vol- 
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umes of data, the last fifteen years have seen the emergence of 
powerful computational infrastructures that, in turn, have spurred 
new programming languages technologies, resulting in a general¬ 
ized shift from sequential to parallel programming models. 

The quest for processing extreme data on complex platforms 
in a programmer-accessible way has been key to the success of 
MapReduce il and of the entire Apache ecosystem centered 
around Hadoop (2). MapReduce allows developers to expose par¬ 
allelism in their applications by means of powerful high-level com¬ 
puting primitives (map and reduce functions), hiding the details of 
how a computation is actually mapped to the underlying distributed 
platform. Its runtime system automatically parallelizes the compu¬ 
tation, handling complex low-level details of the execution (data 
partitioning, task distribution, load balancing, node communica¬ 
tion, fault tolerance) and making it possible to scale applications 
to large clusters of inexpensive commodity nodes. Since the in¬ 
troduction of MapReduce in 2004, there has been a proliferation 
of programming models and software frameworks for large-scale 
data analysis in response to diverse application requirements (e.g., 
iterative processing QilMI. streaming 1261 . incremental computa¬ 
tions (3], SQL-like languages I27II31I . and graph processing 1221 1. 

Big data systems have significantly improved software develop¬ 
ment at scale: the ease of programming and the capability to ex¬ 
press ample sets of algorithms have been primary concerns in their 
design. However, these frameworks typically turn out to be a “black 
box” to programmers, who are still faced with many diverse and 
difficult issues when debugging and optimizing applications. For 
instance, when a user runs a MapReduce job that seems to take an 
abnormally long time, there is no easy way of pinpointing the rea¬ 
son for that behavior. Our own experience is confirmed by anecdo¬ 
tal evidence in many developers’ forums, where programmers often 
indicate unexpected performance behaviors asking for insights. 

Progress analysis. An important problem targeted by a variety 
of works in the last few years is to predict the performance of 
MapReduce applications designing accurate progress indicators 
(see, e.g., (laiiiiiiiiisi). A progress indicator keeps program¬ 
mers informed on the percentage of completed computation time 
during the execution of a job. This is especially important for long- 
running applications and can shed light into abnormal behaviors, 
helping programmers distinguish between slow or stalled compu¬ 
tations and pinpointing algorithmic inefficiencies, programming er¬ 
rors, or load balancing issues. In certain settings, especially in pay- 
as-you-go cloud enviroments, the user might want to identify slow 
jobs and possibly abort them in order to avoid excessive costs. 
Progress analysis can also guide useful profile-driven optimiza¬ 
tions, such as skew mitigation techniques I17lll8l . It is interesting 
to note that Hadoop {2) comes with its own progress indicator. 

A typical hypothesis in the design of progress indicators is that 
the running time depends linearly on the input size. For instance. 
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the completion time of a task may be computed as the product be¬ 
tween the size of the unprocessed data for that task and the average 
processing speed observed so far. This linear progress assumption 
can be a serious limitation in some applications, especially when 
the computational complexity of map/reduce functions grows more 
than linearly with respect to the input size. Such computations are 
not unfrequent. Computing the clustering coefficient, which is very 
useful in social network analysis, is one such example: state-of- 
the-art algorithms exploit reduce functions with quadratic worst- 
case running time ea. Many recommendation systems for Web 
sites such as Yahoo! or Linkedin are also more and more driven 
by computation-intensive analytics over massively distributed data. 
Large running times, combined with data skewness (e.g., power- 
law degree distributions in social networks) are responsible of the 
so-called curse of the last reducer phenomenon, where 99% of 
the computation terminates quickly, but the remaining 1% takes a 
disproportionately longer amount of time. Unfortunately, the wall- 
clock times of stragglers (slow running tasks) are far from being 
well-predicted under the linear progress assumption. 

Another widespread practice is to compute progress by exploit¬ 
ing profile data collected from previous executions. This can be 
sometimes misleading due to variability in platform and application 
parameters and to datasets with quite diverse characteristics: it may 
be easily the case that the same algorithm behaves very differently 
even on networks of rather similar size, depending on properties 
such as degree distribution or small-world phenomena (m. 

A motivating example. Figure [T] exemplifies some of the afore¬ 
mentioned issues. We used as a benchmark the Nodelterator 
algorithm for computing clustering coefficients 1301 applied to a 
web graph and a social network from the SNAP project 1201 . The 
upper charts in Figure [T] show the behavior of different progress 
indicators, reporting the actual progress on the a;-axis vs. the es¬ 
timated progress on the y-axis. State-of-the-art progress indicators 
(called Hadoop, JobRatio, and TaskRatio in Figure[TJ incur non- 
negligible errors and wide prediction fluctuations. For instance in 
com-Youtube, after 4 minutes of execution (20% of the actual run¬ 
ning time) the default Hadoop progress indicator estimates that 
roughly 70% of the computation is completed: the programmer 
will thus expect the execution to terminate in about 2 additional 
minutes, while the true wall-clock time will be considerably larger 
(18 minutes). Prediction errors are mainly due to stragglers, whose 
presence is shown in the swimlanes plots at the bottom of Figure[T] 

Our contribution. Progress analysis issues exemplified by Fig- 
ure[^are the starting point for this paper. Our main contribution is 
the design and the implementation of a novel progress indicator that 
operates without the linear model assumption and exploits only dy¬ 
namically collected profile data. Moving away from the linear hy¬ 
pothesis requires a complete rethinking of the underlying progress 
model, but offers significant benefits in the presence of skewed 
data and computations with straggling tasks. Our progress indica¬ 
tor, called NearestFit, significantly improves the current state- 
of-art on progress analysis. For instance, in Figure[T]NearestFit 
almost matches the optimal progress (straight line). In more details, 
our main contributions can be summarized as follows: 

• We formalize a profile-guided theoretical progress model based 
on a careful combination of nearest neighbor regression and 
statistical curve fitting (hence the name NearestFit). While 
nearest neighbor regression is very stable over time and yields 
accurate estimates, resorting to curve fitting is beneficial to 
perform extrapolations, i.e., to predict the running times beyond 
the range of the already observed executions. This turns out 
to be especially useful in the presence of data skewness and 
straggling instances. 
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Figure 1. Computing clustering coefficients in networks web- 
Berkstan (left) and com-Youtube (right). The upper charts plot 
progress estimates of different progress indicators. The lower charts 
(swimlanes plots) show load unbalancing among worker nodes. 


• In order to apply regression analysis techniques, we profile the 
executions of map/reduce functions, collecting a set of data 
points that relate the running time of each invocation to its 
observed input size. Such fine-grained profiles are very difficult 
to manage in practice, yielding large time and space overheads. 
We nevertheless show that we can make NearestFit practical: 
this is achieved through space-efficient data structures and data 
streaming algorithms, which enable the efficient computation 
of accurate approximations for some of the quantities used in 
our model. 

• We perform an extensive empirical assessment of NearestFit 
on the Amazon EC2 platform using a variety of real-world 
benchmarks, which expose different computational patterns of 
MapReduce applications. We compare NearestFit against 
state-of-the-art progress indicators showing that: 

■ the accuracy of NearestFit is generally very good, even in 
the presence of data skewness and load unbalancing, which 
can seriously harm competitors; 

■ the orderly combination of nearest neighbor regression and 
curve fitting is crucial to obtain good progress estimates: 
none of the techniques alone achieves reasonable results; 

■ space and time overheads can be kept small; 

■ the use of space-efficient streaming data structures guaran¬ 
tees efficiency, without affecting accuracy. 


2. Anatomy of a MapReduce job 

Introduced by Google in 2004, MapReduce has emerged as one 
of the most popular systems for batch processing of extreme 
datasets (8). Its functional-style programming model - where pro¬ 
grammers just need to implement two distinct map and reduce 
functions - is indeed simple and amenable to a variety of real-world 
tasks, while low-level issues are automatically and transparently 
handled by the runtime system. 

Map and reduce functions are executed inside map tasks and 
reduce tasks, respectively. Both of them work on data represented 
as {key, value) pairs. Input and output pairs are stored on a dis¬ 
tributed file system, such as Google’s file system or HDFS. 
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Figure 2. Map, shuffle, and reduce phases in a single wave sce¬ 
nario with two map tasks and two reduce tasks. 


Inner workings of MapReduce. The runtime system splits the 
joh input into fixed-size chunks (e.g., 64MB each), spawning a 
map task per chunk. Depending on cluster capacity (i.e., number 
of worker nodes), multiple map tasks can he run in parallel. Each 
map task scans its input chunk, invoking the map function on each 
{k, v) pair. A map invocation can emit a list of intermediate {k', v') 
pairs. As soon as all chunk pairs have been processed, the map task 
starts a local shuffle phase, where intermediate pairs are sorted and 
partitioned among reduce tasks using a key hash partitioned The 
same intermediate key k' can be emitted by different map tasks, 
but ends up to be always assigned to the same reduce task. Hence, 
the partitioner is crucial to distribute the shuffle data to different 
reduce tasks so as to avoid data unbalancing issues. 

When all the local shuffle phases have terminated, MapReduce 
spawns reduce tasks in parallel (we avoid discussing slow start, 
where reduce tasks can start fetching data earlier than this time). 
The number of reduce tasks is a user-defined parameter: depending 
on cluster capacity (and similarly to map tasks), reduce tasks can 
be executed in a single wave (i.e., one task per worker, all tasks 
in parallel) or in multiple waves (i.e., more tasks sequentialized on 
the same worker). The swimlanes plots in Figure[T]show a multiple 
wave execution on the left and a single wave on the right. 

Each reduce task starts with a local shuffle phase where {k', v') 
pairs produced by different map tasks are fetched and sorted based 
on k'. Several key groups {k', Vy) are obtained by shuffling, where 
set 14' contains all intermediate values associated to key k'. A re¬ 
duce function is then invoked for each key group (fe', 14 '), produc¬ 
ing output {key, value) pairs eventually stored in the distributed 
file system. 

Phases. At a higher level, the execution of a MapReduce job is 
commonly split into three distinct phases, roughly corresponding 
to the executions of map functions, shuffling, and reduce functions. 
As discussed before, shuffling is performed by both map and re¬ 
duce tasks. Throughout this paper we assume that the map phase 
includes the time elapsed from the beginning of the first map func¬ 
tion to the termination of the last one. Similarly, the reduce phase 
includes the time elapsed from the beginning of the first reduce 
function to the termination of the last one. The time elapsed from 
the termination of last map function and the beginning of the first 
reduce function is part of the shuffle phase. We refer to Figurej^for 
an example. According to our definition, during the shuffle phase 
the job performs exclusively shuffling operations (i.e., no map or 
reduce function executions). 

Sources of skewness. As discussed in previous works 11311181 . 
several kinds of skewness can negatively impact the performance 
of MapReduce jobs. A first challenge is to split intermediate keys 
uniformly among the reduce tasks: if the hash function is not de¬ 
signed properly, it can introduce partitioning skewness, responsi¬ 
ble of straggling tasks that can significantly impact the job perfor¬ 
mance. Even if keys are evenly distributed, we can have shuffle data 
skewness where a few key groups {k, Vk) are much larger than the 
other ones. This happens quite often in real data sets and is there¬ 


fore a very critical issue: e.g., nodes in large social networks ex¬ 
hibit power-law degree distributions, and grouping node neighbors 
during the map phase may result in very unbalanced neighborhood 
sizes 1301 . High running times of the reduce functions make shuf¬ 
fle data skewness even more critical: if the implementation of the 
reduce function uses a super-linear algorithm, a few executions can 
easily become a performance bottleneck for the entire job. Multi¬ 
modal input distribution can also arise when different data sets are 
concatenated to obtain the input of a single job: it may be the case 
that chunks obtained from different input data sets may require dif¬ 
ferent processing times in practice due to different characteristics. 
We designed our model so as to take care of all these sources of 
skewness. 

3. The NearestFit progress indicator: theoretical 
model 

In this section we describe the overall design of our progress indica¬ 
tor, called NearestFit. Similarly to previous works and according 
to Section]^ we break down the progress of a MapReduce job into 
three different phases (map, shuffle, and reduce), providing sepa¬ 
rate progress estimates for each of them. In the description we first 
focus on the reduce phase: this is typically the most computation¬ 
ally demanding phase - where data skewness and load unbalancing 
can amplify “curse of the last reducer” phenomena 1301 - and also 
the most complex with respect to progress prediction. The size and 
characteristics of its input depend indeed on the selectivity of the 
map function, and thus on the output of the previous phases. For 
the sake of presentation, we initially consider the simplified sce¬ 
nario where there is a single wave of reduce tasks, i.e., all reduce 
tasks are immediately started after shuffling. In Section [T^ we gen¬ 
eralize our model by removing the single-wave assumption and by 
taking into account the map and shuffle phases. 

3.1 Model overview 

A bird’s-eye view on our approach is shown in Figure For 
the reader’s convenience, a summary of the notation introduced 
throughout this section is provided in Table When needed, we 
use symbol to distinguish between the estimate of a quantity 
computed by our algorithm and its exact value, typically unavail¬ 
able at runtime. 

The progress of the reduce phase can be computed at any point 
in time during its execution. In practice, updates take place at dis¬ 
crete moments (every 60 seconds in our implementation), rather 
than continuously. We denote by progress(t) the progress esti¬ 
mated at time t, i.e., the percentage of elapsed time since the begin¬ 
ning of the phase. This is necessarily an estimate, as the wall-clock 
time of the reduce phase (which is required to compute the per¬ 
centage) will be available only upon termination of the last reduce 
function, at which point the progress should be 100%. 

As shown in Figure!^ we compute progress{t) by estimating 
the ending time e{t) ofthe reduce phase. To this aim, we gather 
profile information at the task level, so as to predict the ending 
time ei{t) for each reduce task i. In turn, we compute by 
profiling the execution of the reduce functions run by task i and by 
combining the predicted running times /(|14|). 
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t 

current time (progress is estimated at time t) 

progress{t) 

{t tstart') j (^(^) tstart) 

e(f) 

max d (t) 

reduce tasks i 

d{t) 

Pi {t) + d (t) — end time estimated for reduce task i 

Pi{t) 

time of the most recent profile update for task i: Pi(t) < t 

dit) 

estimate of the remaining time for reduce task i 

Ki 

set of keys assigned to reduce task i 

K,{t) 

keys assigned to reduce task i not yet processed at time Pi(t) 

Mt) C K, 

Ki {t) = Ki if no profile has been collected until time pi {t) 

Vk 

set of values associated to a key k 


theoretical cost model for the reduce running time 

ri(t) 

^ f(k,Vk)= exact remaining time for reduce task i at time pi{t) 

kGKi(t) 

m 

size (in bytes) of the key group (fc, 14) 

Jm\) 

estimate of f{k, 14), as a function of the size of the input key group 

Tiit) 

E fm) 

k£Ki(t) 


Table 1. Summary of the notation used in the description of our model. 


We now describe each step in more detail. We will discuss 
in Section [^profile gathering issues, describing when and which 
profile data are collected by NearestFit to estimate progress. 

3.2 Estimating the reduce phase progress: progress{t) 

The progress of the reduce phase is estimated at time t as the ratio 
between the time elapsed from the beginning of the phase until time 
t and the predicted duration of the phase: 

progress{t) = . ^ tsta r^ ^ 

^ ^ e{t) - tstart 

where tstart is the starting time of the reduce phase (i.e., the 
starting time of the first reduce function) and e{t) is an estimate of 
its ending time. Since more and more profile data become available 
as the computation proceeds, the end time estimate is a function of 
t and will likely return different values when computed at different 
times. Under the single-wave assumption, e{t) can be obtained as 
the maximum of the predicted end times among all the reduce tasks, 
as also shown in Figure]^ 

e{t) = max ei{t) (2) 

reduce tasks i 

The task end times ei{t) are also estimates, and can be either 
smaller or larger than the actual end times d (t) . This can yield to 
underestimating or overestimating e{t). In the former case, where 
e{t) < e{t) as in the example of Figure]^ the estimated progress 
will be larger than the actual percentage of elapsed time, giving 
developers the temporary illusion that the computation proceeds 
faster than it is actually doing. The latter case is symmetric. 

3.3 Predicting the ending time d (t) for task i 

The end time Si (t) of a reduce task i can be estimated as the sum of 
the past and of the remaining execution time of task i. We denote 
the latter amount with ri{t). The computation of ri{t), which is 
addressed in Section [J!4l exploits profile data collected from task i 
as described below. 

NearestFit collects a reduce profile for each reduce task. 
Profiles are periodically sent to the application master to update 
the progress estimate. The reduce profile of task i is updated upon 


Fdt) 


Task 1 1 1 





Task 2 II n 


1 1 1 
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tstrt Pj(t) 
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(t) t e;(t) 
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Figure 4. Estimated end times for two reduce tasks and for the 
entire reduce phase. 


termination of each reduce function invocation in task i, i.e., as 
soon as a new key group has been fully processed. 

Profile updates are unlikely to happen at the same time t at 
which we perform a prediction: in general, the prediction at time 
t will use profile data collected from task i at some previous 
time, which we call Pi{t). More formally, Pi{t) is the time of the 
most recent profile update for task i and satisfies the following 
properties: 

• by definition, pt (t) < t\ 

• due to data skewness, a reduce task may be stalled executing 
the same reduce instance for a long time, in which case pi {t) 
might be much smaller than t\ 

• since different reduce tasks are executed in parallel without 
any kind of synchronization, reduce functions are likely to 
terminate at different times in different tasks: hence, it can be 

Pi{t) ^Pj{t) 

The relation between quantities t, Pi{t), dit), and ri{t) for two 
parallel tasks is shown in Figure]^ Under the single-wave assump¬ 
tion, we can now estimate d(t) as: 

d{t) = Pi{t) + ri{t) (3) 

Distinguishing between the prediction time t and the last profile 
update time Pi{t) implies that we can avoid speculating about the 
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status of the reduce instance currently running at time t in task i, 
which might prove to be difficult. As a consequence, in Equation]^ 
we scale down from t to Pi{t) the time passed since the beginning 
of the task, and absorbe the time in-between Pi{t) and t in ri{t). 
Figure|^gives a visual clue on the involved quantities. 


3.4 Estimating the remaining time ri{t) for task i 

Let Ki be the set of keys assigned to a reduce task i (as we will 
see in Section|^ Ki can be computed by profiling the map phase). 
At time t, Ki can be conceptually partitioned into three subsets: 
fully processed keys (whose corresponding reduce instances have 
already terminated), untouched keys (whose corresponding reduce 
instances have yet to be started), and a single key that is currently 
being processed (whose corresponding reduce instance started at 
time Pi {t) and has not yet terminated). We denote with Ki (f) C 
Ki the set of currently processed and untouched keys at time t 
(light gray and white rectangles in Figure]^. The exact remaining 
running time ri{t) for task i is now given by: 

riit)= ^ 4 ) 

keKi(t) 

where f(k, 14) is a cost model for the reduce running time, de¬ 
pending both on the input key k and on the input values 14. Since 
the true cost model / is typically unknown, we need to estimate 
/(fc, 14). Our approach is to learn from past executions of the re¬ 
duce functions: since we know exactly f(k, 14) for each fully pro¬ 
cessed key k £ Ki\ Ki(t), we exploit this knowledge to predict 
the running time of the unprocessed keys in Ki{f). 

With a slight abuse of notation, we will use 1141 to denote the 
size (in bytes) of the key group (A:, 14). Guided by asymptotic 
analysis and by previous works on performance profiling (see, 
e.g., (Him), we assume that the running time of a reduce function 
depends on the input size, and not on the actual input values: thus 
/(A:i,14J « /(fe2,142j whenever 114J « ||. With this 
assumption, an estimate ri{t) for the remaining running time of 
reduce task i can be obtained as: 


n{t)= Y. ( 5 ) 

keKi(t) 

where /(|14|) is an estimate of f{k, 14), depending on the input 
size. In Section 


3.5 


we discuss how to compute /(|14|). 


3.5 Predicting the running time / (1141) of reduce functions 

We use regression analysis to predict the running time of a reduce 
function on key group (fc, 14), using |14| as independent variable. 
In particular, we combine two well-known techniques: 5-nearest 
neighbor regression and curve fitting. Each technique suits a dif¬ 
ferent scenario: we will see in Section]^ that a careful integration 
yields very accurate progress estimates, even in highly skewed set¬ 
tings. 

Nearest-neighbor regression. This is a kind of instance-based 
learning, where data is classified according to the training examples 
that are closest to the new point HD In our setting, the training 
examples are the fully processed keys k' £ Ki \ Ki(f), for which 
the exact running time f{k', Vy) is known. 

Given an unprocessed key k £ Ki{t), we can predict f{k, 14) 
as follows. Let 5 > 0 be a constant and let Ni{t, k) <£ Ki\ Ki{t) 
be the ^-neighbourhood of k: Ni{t, k) contains those keys k' that 
are fully processed at time t and whose input size is 5-close to k 
(formally, 114' | £ [| 141 — 5,1141 -f 5]). Different rules can be used 
to derive /(|14|) from Ni{t, k). The approach we take is among 
the simplest ones and is to average the running times observed for 
the 5-neighborhood: 



Figure 5. Profile data used in the prediction of /(|14|): Di{t), 5- 
neighborhood Ni{k, t), and a curve fitting model. 


f{\Vk\) 


E f{.k',Vy) 

k'GNi(t,k) 

|iVt(4fc)l 


(6) 


Nearest-neighbor regression is well-defined and yields an estimate 
for /(|14|) only if Ni{t, k) is non-empty. In some cases, however, 
it may be necessary to perform significant extrapolations during 
the reduce phase, i.e., to predict the running times well-beyond the 
range of the already observed input sizes. This is typically the case 
in the presence of data skewness, where a few reduce instances 
get inputs much larger than the average input size: these straggling 
instances prominently affect the task ending time, but are likely to 
have an empty neighborhood. 


Curve fitting. Statistical curve fitting addresses the construction 
of a mathematical model that has the best fit to a set of data 
points (5). In our setting, the set Di{f) of data points used to build 
the model at time t has two dimensions, describing respectively in¬ 
put sizes and running times of terminated executions of the reduce 
function: 


A(f)= U {{\Vy\,f{k\Vy))} il) 

k'eKi\Ki(t) 

Figurel^shows the relation between Di{t) and the 5-neighborhood 
Ni{t,^ of an unprocessed key fc. As a fitting model, we use 
a + b ■ x'^, which generalizes both power law and linear models 
(where a = 0 and c = 1, respectively). As shown in previous 
works (Hini, this model can characterize different kinds of real- 
world benchmarks. Curve fitting yields a closed form expression 
that approximately describes the running time of a reduce function 
on input size 1141, yielding clues to its growth rate. It will predict 
7(1^41) as: 

/(|I4|)=a + fo.|14r (8) 

Curve fitting is quite appealing, but can be difficult to tune in prac¬ 
tice due to data noise, often producing low-quality cost models 
(with small values). Moreover, it can be unstable when repeat¬ 
edly used over time, as in progress analysis: even small changes 
in the model coefficients a, b, and c can unfortunately yield quite 
different predictions. 


NearestFit: combining nearest neighbors with curve fitting. We 
observed that nearest-neighbor regression is typically very accu¬ 
rate, but cannot be always applied. Conversely, curve fitting is po¬ 
tentially always applicable, but its accuracy crucially depends on 
the fitting quality. We therefore combined the two techniques so as 
to overcome their drawbacks while retaining their advantages. To 
predict /(|I4|), the reduce task i orderly considers the following 
options: 

1. checks if Ni{t, k) is non-empty: in this case uses the nearest- 
neighbor prediction according to Equation]^ 
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2. computes a model a+6- x'^ on set Di {t) specified in Equation]^ 
if the model has good fitting quality (R^ > 0.9), uses the 
prediction given by the model; 

3. uses nearest-neighbor regression, if possible, on data points 
collected from reduce executions in all reduce tasks; 

4. among the curve fitting models available for all reduce tasks, 
predicts according to the model that best estimates the data 
points Di{t) currently available at task i. 

Notice that the first two steps use task-local information, while the 
latter ones exploit information gathered from all the other tasks. 
Several other combination strategies could be considered. In our 
experiments, this approach provided very reasonable tradeoffs be¬ 
tween accuracy and efficiency. Indeed, it resorts to curve fitting - 
and ultimately to global information gathered from other tasks - 
only in very rare cases. Dealing with these cases is however crucial 
for an accurate progress estimate, since they often correspond to 
the most time-demanding (and difficult to predict) executions. 

3.6 Generalizations 

Map phase. The approach described for the reduce phase can be 
also extended to the map phase. From a high level perspective, 
NearestFit exploits two main ingredients: the running times of 
past function executions and a characterization of the input. Run¬ 
ning times can be collected for map functions similarly to the re¬ 
duce phase. On the other hand, the main challenge is how to com¬ 
pute the set Kj of {key, value) pairs assigned to a map task j: 
while Ki, for each reduce task i, can be obtained by profiling the 
map phase (as we will see in Section]^, we have no such informa¬ 
tion about map input before actually running the map tasks. Scan¬ 
ning the entire job input before the execution would be prohibitive. 
Hence, similarly to previous works 11911291 we can exploit sam¬ 
pling to approximate this information. 

In practice, stragglers are unlikely in the map phase: splits as¬ 
signed to a map task have a limited size, and the input of each map 
function is simply a single {key, value) pair, instead of the possi¬ 
bly long list of values received by reduce functions. Hence, in prac¬ 
tice a linear prediction model is good enough to estimate progress 
of the map phase. This is the choice in our implementation, which 
proved to be fully satisfactory in our experiments. 

As a final note, the output of map functions could be locally 
aggregated through the use of combiners, which however operate 
on a limited amount of data: the output of map functions is buffered, 
and buffers typically have a small size. Moreover, combiners run 
in parallel with the map functions, as a separate thread during the 
map phase. Hence, we do not explicitly predict the running time of 
combiners, which is covered by the map phase progress. 

Shuffle phase. As observed in previous works 1^ . the shuf¬ 
fling phase is application-independent and its performance can be 
safely estimated using only execution profiles collected on micro¬ 
benchmarks. Microbenchmark predictions need to be scaled pro¬ 
portionally to the amount of shuffle data, which can be obtained by 
profiling map tasks. 

Removing the single-wave assumption. So far, we have assumed 
that all reduce tasks immediately start after shuffling, i.e., at time 
tstart- In general, this is not true if the task parallelism available 
in the cluster is smaller than the number of reduce tasks. In order 
to consider multiple execution waves in our prediction model, we 
need to refine Equation]^ which estimates the ending time et (t) of 
a reduce task i. At any time t, we distinguish between unscheduled 
and running tasks (the ending time for terminated tasks is already 
known). 


For running tasks, everything is estimated as in Section |3.3| 
However, before receiving the first profile from task i, we take care 
of initializing pi {t) to the task starting time (instead of tstart). 

For unscheduled tasks, we first estimate their remaining time 
ri{t), including the running time required for shuffling pairs (this 
is done similarly to the shuffle phase). Since local task information 
is not yet available, we exploit information collected from the other 
tasks, predicting running times according to points 3 and 4 of the 
combined approach described in Section [J3] (in point 4, we choose 
the curve fitting model with the best quality R?). Once ri{t) is 
available for all tasks, we simulate the scheduler to deduce the 
starting time for each unscheduled task. We thus obtain an estimate 
of Pi{t). We remark that reasoning on the scheduler choices is 
a standard practice in previous performance analysis papers (see, 
e.g., (H |Sl HH ED), which typically assume a simple online 
greedy scheduling strategy as in this paper. 

4. An operational view of NearestFit 

To estimate progress, NearestFit needs to collect diverse infor¬ 
mation, combining coarse- and fine-grained profiles. In this section 
we discuss the orchestration of NearestFit profiling components, 
which work in different job phases and exploit both distributed and 
centralized computations. We remark that gathering fine-grained 
profile data can result in very large time and space overheads, be¬ 
ing thus unfeasible in practice. For the time being we ignore this is¬ 
sue, describing a simple-minded operational view of the theoretical 
model presented in Section]^ We defer to Sectionj^the discussion 
of algorithmic techniques aimed at making the model practical: this 
will be achieved by computing suitable approximations for some of 
the quantities explicitly maintained throughout this section. 

4.1 Gathering profiles 

NearestFit uses three different profiles: (1) map task profiles col¬ 
lected during the map phase by each map task; (2) a key distribution 
profile computed by the application master from the map task pro¬ 
files; and (3) reduce task profiles collected during the reduce phase 
by each reduce task. 

Map task profiles. For each key k emitted by a map task j, k may be 
produced, associated with different values, by distinct map function 
invocations within task j. Let Vk,j be the set of values associated 
to k by task j. The map task profile maintains all pairs (fe, |I4,j 1). 
The profile is sent to the application master upon termination of the 
map task. 

Key distribution profile. For each key k emitted during the map 
phase, the application master is aware of which reduce task will be 
responsible of processing k. By reversing this information, it can 
therefore obtain the set Ki of keys assigned to each reduce task i. 
For each k € Ki, we also need to know the size |14| of its key 
group in order to compute K{t) as in Equation!^ |I4|, however, 
is unknown before fetching and merging the sets Vfe j stored in the 
local file systems of each map worker j. The map task profiles are 
thus put to use to compute 1141: 

|F.| = ^ |I4,,| (9) 

map tasks j 

Overall, the key distribution profile will contain, for each reduce 
task i, all the keys k G Ki along with their value sizes |14|. 

Reduce task pro files. Reduce task profiles are collected as de¬ 
scribed in Section [T3| Profile data gathered at time Pi{t) contain 
information on the reduce function execution just completed in re¬ 
duce task i. If {k, Vk) is the key group processed by that reduce 
function, the profile reports the key k, the input size 1141, and the 
running time /(fc, 14). 
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4.2 Application master data structures 

During the reduce phase, for each reduce task i the application 
master maintains: 


the last reduce profile update time pi (t) - see Section 3.3 


the set of unprocessed keys Ki{t) - see Section 3.4 


the past executions data points Di{t) - see Section 3.5 


At time tstart, we have Ki{t) = Kt, pi{t) = tstart, and Di{t) = 
0. Ki is obtained from the key distrihution profile. When the re¬ 
duce function for key group {k,Vk) terminates at task i, the ap¬ 
plication master updates pi (t) to the termination time, removes k 
from Ki{t), and adds a new point {\Vk\, f{k, 14)) to Di{t). 


4.3 Updating progress estimates 

At any time t, the progress indicator can he brought up to date 
by first computing /(|14|) for each unprocessed key k as follows. 
The curve fitting model parameters of each reduce task are updated 
using sets Di(t). Moreover, the 5-neighborhood Ni{t, k) of keys 
k € Ki{t) is obtained from |14| and Di{t) (see Figure]^ and 
Equation [6| is used if the 5-neighborhood is non-empty. Once es¬ 
timates /(|14|) are available, the progress can be determined ac¬ 
cording to Equations|^[^|^ an d[T] 


5. Making NearestFit practical 

As observed in Section|^ we need to minimize the amount of pro¬ 
file data that percolates through the framework - and ultimately 
through the network - both for time and for space efficiency rea¬ 
sons. in practice, we cannot afford to collect on worker nodes fine¬ 
grained profiles that are later processed by the application master in 
a centralized way. The key insight to solve this issue is to get rid of 
keys in our profiles. As shown by Equation]^ only input sizes |14| 
are needed to predict the remaining times: keys are not used except 
for defining the terms of the sum, which iterates over k € Ki{t). 
In this section we show that we can avoid to maintain Ki{t) ex¬ 
plicitly, revisiting the operational view of NearestFit presented 
in Section Getting rid of keys presents challenges at different 
points, but also offers several benefits: the set of distinct keys in 
a MapReduce job is typically huge, while the set of distinct input 
sizes is much smaller and tractable (many key groups have similar, 
if not identical sizes, and can be aggregated). 

5.1 Gathering key-independent profiles 


Key distribution profile. Since we no longer have information on 
all keys, we cannot compute Kt explicitly. We operate differently 
on explicit and implicit keys. 

For implicit keys assigned to reduce task i, we can compute their 
total size by summing up the values ^ \ Vk,j \ received by all map 
tasks j. Instead, we cannot know their number, since summing up 
the numbers \Ki^j \ — \Eij\ received by map tasks j yields a very 
inaccurate upper bound: some of the map tasks might emit the same 
key, that would be counted more than once in the sum. We will later 
show how to get a safe estimate using information collected during 
the reduce phase (see Section [531 l. 

We now consider explicit keys received by reduce task i. For 
each k € Et^, we know from the map task profiles its number of 
values I Vk.j \ emitted by map task j. We thus estimate: 

E E (11) 

map task j 

This is not the exact value size of key k: |14| < |14| since the 
same key can be explicit in a map task j and implicit in a different 
map task. Map tasks for which k is implicit do not contribute to the 
sum. However, in the presence of data skewness, we expect IVfcl to 
be large with respect to the majority of value sizes of other keys, 
and therefore it is likely that 1141 ~ 1141. 

To bound memory consumption of the application master, we 
introduce an additional approximation when computing the sum in 
Equation [TT| The master must merge the explicit keys emitted by 
all map tasks in order to estimate 1141 according to Equation 11 
Although the number A of explicit keys emitted by each map task 
is bounded, a large number of map tasks might yield a prohibitively 
large set of explicit keys on the master side. Eor instance, if there 
are 50000 map tasks and each of them emits 2000 explicit keys, 
the master would receive one million pairs. To overcome this issue, 
we aggregate the sizes of the smallest explicit keys in the union 
set, making them implicit. This is done through a space-efficient 
algorithm that maintains frequent items over data streams (23l, 
where the sets of explicit keys received from each map task are 
regarded as streams. The Space Saving algorithm |23l processes 
streams on-the-fly, without storing them entirely, and can return the 
heaviest keys - as well as an estimate of their sizes - with very high 
accuracy. 

Reduce task profiles. Reduce task profiles are computed as de¬ 
scribed in Section |4.1[ but omitting the input key: for each key 
group {fc, 14) processed by a reduce function, the profile reports 
only 1141 and the running time required to process the group. 


Map task profiles. A map task j splits its emitted keys into two 
logical categories: explicit and implicit keys. We recall that 14 j is 
the set of values associated to key k by task j. Pairs (fe, |14j |) are 
maintained in the profile only for explicit keys, whose number is 
fixed to a constant A (in our implementation A = 2000). Explicit 
keys are the A keys with the largest sizes |14jj. This set can be 
obtained at the end of the local shuffle phase of a map task. Intu¬ 
itively, we focus on explicit keys because large numbers of values 
can result in high reduce running times. Slow reduce instances, in 
turn, are likely to have non-negligible effects on the job progress. 

The remaining (implicit) keys are not reported in the map task 
profile, except for aggregate values. Let Kt^j be the set of keys 
emitted by map task j and assigned to reduce task i, and let Etj 
be the maximal subset of Ktj containing only explicit keys. Then, 
for each reduce task i, the map task profile contains the pair: 

{\K,J-\E^J, \VkJ) (10) 

k^Ki^j \Ei^^j 

This gives the number of implicit keys emitted by map task j and 
assigned to reduce task i and their total size. 


5.2 Application master data structures 

With respect to the data structures used in Section [4)^ we can no 
longer maintain the set Ki{t) of keys not yet processed by reduce 
task i at time t. Indeed, Ki{t) was initialised to Kt, which is now 
unknown to the application master except for explicit keys. Instead 
of Ki (t), at time t we now maintain: 

• a set Si{t) of approximate sizes of unprocessed explicit keys; 

• the total size s{t) of unprocessed implicit keys. 

Both Si{t) and s{t) can be initialised with information available 
in the map distribution profile (recall that the approximate sizes of 
explicit keys have been computed in Equation|l l|l. 

When the reduce function for key group (k, 14) terminates at 
reduce task i, the application master receives (|I4|,/(fc, 14)) in 
the reduce task profiles. Si{t) and s{t) are updated as follows: 

• if 1141 is (close to) an explicit size |14| maintained in Si{t), we 
delete 1141 from Si (t) ; 

• otherwise we decrease s{t) by |14|. 
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Di{t) and Pi{t) can be updated as before (see Section [43^ . We 
remark that all the data structures of the application master are now 
independent of keys, and use only the sizes of the key groups. 

5.3 Updating progress estimates 

We now discuss how to update the progress indicator using the data 
structures described in Section [53| The main issue is to compute 
the remaining time of each reduce task i: Equation [^iterates over 
the set Ki{t) of unprocessed keys, which is however no longer 
available. This issue can he naturally solved for explicit keys, 
whose approximate sizes are explicitly maintained in Si{t). The 
remaining time for explicit keys can be computed as: 

Mt) = ^ f{\^\) (12) 

|t4|6Si(t) 

where / is obtained as before either via nearest-neighbor regression 
or via statistical curve fitting. 

Computing the remaining time for implicit keys is more chal¬ 
lenging, since we know only their total size s{t). Computing the 
average size for implicit keys turns out to be impossible, as the 
number of implicit keys is unknown (see Section [5T| . With no in¬ 
formation on individual sizes |14| and on the number of distinct 
implicit keys assigned to a reduce task, we cannot decide which 
and even how many terms contribute to the summation given by 
Equation]^ We approximate the missing information by learning 
the key size distribution during the execution of reduce functions. 
Namely, we maintain an approximate histogram of the most fre¬ 
quent sizes encountered during the reduce phase and we partition 
s{t) based on this distribution. In this way we obtain estimates of 
the number of implicit keys and of their sizes, which we plug in 
Equation|^to obtain the remaining time for implicit keys. 

Notice that throughout the entire computation we never used 
keys (neither explicit nor implicit), but we limit to exploit different 
- explicit or aggregate - approximations of the key group sizes. 

6. Hadoop implementation 

We implemented NearestFit on top of Hadoop 2.6.0. Apache 
Hadoop 21 is perhaps the most widespread open-source implemen¬ 
tation of MapReduce. The implementation consists of three main 
components: map task tracker, reduce task tracker, and progress 
monitor. The map task tracker is deployed on map worker nodes 
and generates a map task profile once and for all during the local 
shuffling phase. Similarly, the reduce task tracker generates a re¬ 
duce task profile during the execution of each reduce task. Reduce 
profiles are updated during the computation, and updates are pe¬ 
riodically sent to the progress monitor. However, differently from 
the theoretical description of Section [T3| to reduce communication 
overhead the reduce task tracker uses a buffering strategy, delaying 
updates until the termination of a group of reduce functions. The 
progress monitor fetches map and reduce profiles, updating its data 
structures. Notice that the progress monitor is a centralized compo¬ 
nent. Eor performance reasons, it is deployed as a service inside the 
same application master launched by Hadoop for handling the job 
execution. We now discuss some relevant aspects and optimizations 
of our implementation. 

Measuring value size. In our model we repeatedly used the size 
(in bytes) required for storing {key, value) pairs. Since keys and 
values have to be (de)serialized by Hadoop, our implementation 
can efficiently collect their size, incurring a negligible overhead. 

Key hashing. The key emitted by a map function in Hadoop can 
be any kind of user-defined Java class object. To avoid high pro¬ 
cessing and communication costs due to large keys, our implemen¬ 
tation considers a 64-bit hash value computed on the serialized rep¬ 
resentation of the key, instead of its actual object representation. 


This is more space-efficient and incurs no major drawback (e.g., 
hash conflicts) in practice. 

Noise and communication reduction via smoothing. Working 
on sizes instead of keys, as described in Section]^ makes it possible 
to perform smoothing, i.e., to aggregate information about similar 
executions, characterized by similar running times and similar in¬ 
put sizes. In our implementation we aggregate data points of Di{f) 
by merging past executions of reduce functions with the same input 
size and whose running times differ by at most 500 ms. Smooth¬ 
ing turns out to be crucial to bound the amount of profile data sent 
to the progress monitor and also to reduce data noise, which could 
easily harm curve fitting algorithms. 

Measuring running times. Measuring the running time of each 
reduce function execution may be inaccurate for short runs and 
cause a non-negligible slowdown. In our first implementation of 
NearestFit, we experienced a slowdown up to 30% w.r.t. native 
execution, most prominently on jobs characterized by many short 
reduce runs. We have thus introduced a bursting strategy driven 
by the key group sizes |I4|. The main idea is to obtain accu¬ 
rate measurements for key groups related to (heavy) explicit keys, 
while avoiding frequent measurements for key groups related to 
(lightweight) implicit keys. In more details, our bursting strategy 
delivers a time measurement only if one of the following condi¬ 
tions is met: 

• the size of the key group currently being processed is larger than 
an implementation-dependent threshold size; 

• the number of reduce executions that have been skipped (i.e., 
not explicitly measured) is larger than another implementation- 
dependent threshold. 

The first condition is checked during the invocation of the next 
operator, i.e., while iterating over the list of values passed to a 
reduce function. The second condition is checked after each reduce 
execution. Thresholds can be customized: in our experiments we 
used 50 bytes and 100 skipped executions, respectively. 

Let t be the cumulative time of a burst of consecutive reduce 
executions. At the end of the burst we need to assign t to the 
different executions: instead of partitioning the cumulative time 
uniformly, our approach is to split t proportionally to the input size 
of each invocation. In practice, we are using a linear model here, 
but only for executions that are very short. 

7. Experimental setup 

In this section we describe the experimental framework that we 
used to evaluate NearestFit. We discuss the inner workings of 
state-of-the-art competitors, synthetic and real benchmarks used in 
our empirical assessment, performance metrics, and platform setup. 

7.1 State-of-the-art progress indicators 

We compared NearestFit against three different progress indica¬ 
tors: one is the standard indicator provided by Hadoop (a, while 
the last two exploit techniques presented in previous works (251 
|3^ . We implemented the latter techniques doing our best to faith¬ 
fully convey in the implementation their key ideas. The description 
below is cast into our notation and focuses on the reduce phase. 

Hadoop progress indicator. Progress estimates in Hadoop do not 
take into account running times. The progress of the reduce phase is 
given by the average progress of the reduce tasks, and the progress 
of each task is simply the percentage of shuffle data read by the task 
itself. Using the average can badly affect progress estimates in the 
presence of load unbalancing. 
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Job ratio. The main idea in 1251 and 1331 is to compute an av¬ 
erage execution speed a(t) for the reduce functions across all the 
reduce tasks. With the notation introduced in Section|3] 


E E f{k,Vu) 

tasks z k£Ki\Ki(t) 

^-E ivil 

tasks i kGKi\Ki(t) 


(13) 


An exponentially weighted moving average (EWMA) could he 
applied to a{t) in order to smooth fast variations over time 1251 . 
The remaining time ri{t) for reduce task i depends on a{t) as 
follows: 

ri{t) = a{t) X ^ |Vfc| (14) 

kGKi{t) 

Equations and [T^ can he efficiently computed by collecting 
aggregate profile data at the task level. Job ratio correctly takes into 
account load unbalancing among different reduce tasks. However, 
by computing the average across all tasks, it implicitly assumes that 
different tasks process input data at a similar speed. 


Task ratio. This is a variant of job ratio that computes a distinct 
average execution speed ai{t) per task, with the goal of addressing 
different behaviors of the reduce tasks. As long as the amount of 
data processed by task i is small, ai{t) = a(t). Otherwise: 


ai{t) 


E f{k,V^) 

k^Ki\Ki{t) 

keKi\Ki(t) 


(15) 


Similarly to job ratio, an exponentially weighted moving average 
can be applied to values ai{t) to smooth fast variations. The re¬ 
maining time Vi (t) for reduce task i can be now predicted as: 


ri{t) = ai{t) X ^ |14| (16) 

kGKi(t) 


This model takes into account both load unbalancing and different 
execution speeds of the reduce tasks. However, it still assumes that 
the running time of a reduce task scales linearly with respect to 
the size of its input data, ignoring any issues given by superlinear 
reduce computations. 


7.2 A synthetic benchmark 

To perform a preliminary empirical assessment of the accuracy of 
different progress indicators, we implemented a synthetic MapRe¬ 
duce job where we can customize the running time of the reduce 
function, choosing among 0(n), 0(n^), Q{n^), or 0(n^) (linear, 
quadratic, cubic, or quartic). Data sets for the synthetic benchmark 
are generated using different skewness levels, where skewness is 
specified by a number cr > 1: if n is the largest input size of a 
reduce function, we appropriately generate keys so that approxi¬ 
mately (T* functions have input size n/a*, for i > 0. 


7.3 Real-world benchmarks 

We have carefully chosen a variety of real-world benchmarks, 
whose main features are summarized in Table trying to expose 
different computational patterns typical of MapReduce applica¬ 
tions. 

• WordCount is probably the most well-known MapReduce ex¬ 
ample. It counts word frequencies inside a set of documents. After 
tokenizing each document into separate lines during the map phase, 
each reduce function invocation computes the frequency of a differ¬ 
ent word. Both map and reduce functions have a linear complexity. 
By default combiners are active. WordCount-NC is the variant in 
which combiners are disabled. As input dataset for these applica¬ 
tions we used a 50GB archive of Wikipedia articles. 


• Invertedindex is taken from the PumaBenchmark suite [Q. It 
computes the inverted index of a set of documents. The map func¬ 
tion tokenizes lines of a document d, emitting for each word the 
pair (word, d), in linear time. The reduce function is the identity 
function, except for eliminating duplicates in the output. Due to 
a sub-optimal choice of data structures for duplicate detection, the 
reduce function has quadratic running time. Data sets have been de¬ 
rived from the 50GB Wikipedia archive, arbitrarily partitioning ar¬ 
ticles into either 5000 large documents or 50000 small documents. 

• 2PathGenerator generates paths of length two in a graph. 
This is a common step in many graph analytics applications (see, 
e.g., CH HD EOl). Our code is taken from CD. Map functions 
require constant time and emit a pair {u, v) for each arc {u, v) in 
the graph. Reduce functions, given the neighborhood of a node u, 
emit a quadratic number of length-2 paths centered at u. Data sets 
used for this benchmark are six different social networks and web 
graphs taken from the Stanford Network Analysis Project 1201 . 

• TriangleCount implements round 2 of the Nodelterator 
algorithm described in l30t . The reduce function receives a pair 
of nodes u and v as key and a list of neighbors common to u and v 
as values. If arc [u, v) exists, for each common neighbor w it emits 
a pair (w, 1). This requires linear time. At the end of the round, 
the number of emitted pairs is six times the number of triangles in 
the graph (see 1301 for details). We tested TriangleCount on the 
SNAP graphs 120). 

• NaturalJoin is a naive MapReduce implementation of the 
natural join operator between two relations R and S ED. Both R 
and S consist of tuples with two attributes. The map function, for 
each tuple, emits the first attribute as key and the second attribute 
as value. The reduce function joins values associated to same key, 
computing the Cartesian product between input tuples. If nnik) 
and ns{k) denote the number of tuples with key k in relations R 
and S, respectively, the reduce function has complexity 0(n_R(A:) • 
ns{k)). Based on this observation, we tested NaturalJoin on five 
different datasets. 

1. The first three data sets are such that S is not skewed (for 
each k, ns{k) = 0(1)) while R is generated following a Zipf 
distribution 1281 with skewness 1.0, 1.5, and 2.0, respectively; 

2. Both R and S follow a Zipf distribution, with skewness 2.0 in 
R and 1.0 in S; 

3. Both R and S are generated following a Zipf distribution with 
skewness 1.5. 

Since ns{k) = 0(1), the running time of the reduce function on 
the first three datasets becomes 0(nj{(k)), i.e., linear. It remains 
superlinear in the last two datasets. 

• MatMult is taken from a MapReduce library for sparse matrix 
multiplication ID and implements a blocked matrix multiplication 
algorithm. The map function duplicates each block as many times 
as the number of products for which the block is required. The 
reduce function multiplies two input blocks with running time 
0(n^ • d ■ d'), where n is the block side, and d and d' are the block 
densities. We considered three variants of MatMult, using different 
strategies for partitioning keys: 

• In MatMult-Opt the number of block products assigned to each 
reduce task is optimally balanced, as described in m (this is the 
default library partitioner). 

• In MatMult-Rnd block products are assigned randomly among 
reduce tasks. This may result in some reduce data unbalancing. 

• In MatMult-Unbal we forced a very unbalanced partition. Let 
k be the number of matrix blocks. Among the k^^^ block 
products performed by the algorithm, the most computationally 
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BENCHMARK 

GOAL 

REDUCE COMPLEXITY 

# DATASETS 

WordCount 

WordCount-NC 

word frequency counter 
word frequency counter (no combine) 

0(n) 

1 

Invertedindex 

computing inverted index of a set of documents 

e(r?) 

2 

2PathGenerator 

two-length path generation in a graph 

e(^) 

6 

TriangleCount 

triangle counting in a graph 

0(n) 

6 

NaturalJoin 

natural join R\x\ S between relations R and S 

©(ufl(fc) • ns(k)) 

5 

MatMult-Opt 
MatMult-Rnd 

MatMult-Unbal 

sparse matrix multiplication using an adaptive partitioner 
sparse matrix multiplication using a random partitioner 
sparse matrix multiplication using an unbalanced partitioner 

©(n^ ■ d ■ d') 

2 


Table 2. A summary of the real-world benchmarks considered in our experimental evaluation. 


demanding k products are assigned to a single reduce task, 
while the remaining ones are fairly distributed. 

We have tested each MatMult variant on two input datasets. In the 
former, matrix items are uniformly distributed and all matrix blocks 
have the same expected density d = d' — 0.25. In the latter, block 
densities are different depending on the block position: blocks {i, j) 
are such that d = 1/2'^-^-^ and d' = where k is 

the number of blocks in each matrix. Intuitively, density increases 
across columns or rows in the two input matrices. 

7.4 Metrics 


available cluster parallelism, then the amount of shuffle data can 
be partitioned among a higher number of tasks, decreasing the task 
input size and possibly reducing the impact of straggling instances. 
Multiple wave executions, however, can increase the framework 
overhead. We have thus considered two execution scenarios: 

• single-wave, setting R to 7, 15, and 31 on the three clusters; 

• multiple-waves, setting R to two times the cluster parallelism 
(i.e., 14, 30, and 62, respectively). 

This choice has been driven by the optimization suggestions pro¬ 
vided by the official Hadoop documentation m 


In our experimental analysis we assessed prediction accuracy, slow¬ 
down, and space overhead of NearestFit. 


Progress accuracy. This is obtained by computing the absolute 
percentage error between the estimated and the optimal progress: 


err or (t) = 


I Istart 


e(f) Istart 


t istart 


c(f) istart 


X 100 


Let T be the set of prediction times, i.e., times at which progress 
is updated. Then the mean and maximum errors are computed as 
follows: 


avgErr — ^'t'ror{t) maxErr — max error(f) 


8. Experimental evaluation 

In this section we discuss the outcome of an extensive empirical 
evaluation, which required roughly 500 cluster hours over the EC2 
platform. This is a very optimistic estimate: it does not include 
times for cluster configuration, setup times for each single exper¬ 
iment, collection and processing of experimental data, debugging 
issues, and tuning of progress indicators (e.g., NearestFit exper¬ 
iments with different threshold choices). The description focuses 
on the reduce phase. As observed in Section we used a linear 
model in the map phase and results across all benchmarks consis¬ 
tently proved to be very accurate. 


Slowdown. We compare the native running times with executions 
under NearestFit. 

Space overhead. Let \map profile\ and \ reduce profile\ be the cu¬ 
mulative sizes for profiles collected among all map tasks and all 
reduce tasks, respectively. To evaluate space overhead, we com¬ 
pare the amount \mapprofile\ -f \reduceprofile\ with the shuffle 
data size: 

overhead = + \reduceprofile\ ^ 

\shuffle data] 

7.5 Platform and Hadoop configuration 

The experiments have been carried out on three different Amazon 
EC2 clusters, running a customized release of Hadoop 2.6.0. Be¬ 
sides a node for resource managers, the three clusters included 8, 
16, and 32 workers devoted to both Hadoop tasks and the HDFS. 
We used Amazon EC2 ml. xlarge instances, each providing 4 vir¬ 
tual cores, 15 GiB of main memory, and 840 GiB of secondary stor¬ 
age. Since one worker runs the MapReduce application master, the 
actual parallelism available for map and reduce tasks is decreased 
by one on each cluster. We have considered only clusters composed 
by homogeneous machines and disabled speculative execution. 

Among several Hadoop runtime parameters, the number R of 
reduce tasks plays a crucial role during the execution of a job. 
If R is set to the (actual) task parallelism available on a cluster, 
we have a single wave execution where all reduce tasks can be 
immediately started after the map phase. If R is larger than the 


8.1 A warm-up example 

As a preliminary experiment, we used our synthetic benchmark to 
study the interplay between data skewness and time complexity of 
the reduce functions. We fed different versions of the synthetic 
benchmark (linear, quadratic, cubic, and quartic) with data sets 
characterized by different skewness levels. The outcome of the 
experiment is summarized in Figure|^and confirms our hypothesis 
that linear progress models are harmed by data skewness and large 
running times. 

In Figure]^, the dataset is mildly skewed and only the time 
complexity changes. In the 0(n) case, all progress indicators 
roughly match the optimal progress. As the tunning time grows, 
differently from NearestFit, Ratio becomes more and more in¬ 
accurate (the job and task versions, as well as Hadoop, overlap and 
we plot only one of them). 

In Figure 1^ we performed the symmetric experiment, fixing 
the time complexity to 0(n^) and increasing the data skewness 
a from 1.05 (almost unskewed) to 1.8 (rather skewed). Average 
and maximum error of Ratio increase with a, while the maximum 
error of NearestFit is worse for unskewed data. Notice the log- 
scale on the y-axis. Overall, NearestFit appears to be much more 
accurate than Ratio and its error is not badly affected by skewness: 
the average error of NearestFit stays below 1%, while can be as 
much as 33% for Ratio. 

In the previous experiments we used a single reduce task: Fig¬ 
ure]^ shows the progress plot for ct = 1.4 and quadratic com- 
plexrty. The error increases with the number of tasks, especially 
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Figure 6. Experiments on the synthetic benchmark: (a) progress plots for different reduce running times; (b) error of progress estimates for 
0(n^) reduce functions on datasets with increasing skewness; (c-d) progress plots obtained with 1 and 8 reduce tasks, respectively. 



Dataset 

Mean Absolute Percentage Error 

Max Absolute Percentage Error 


Hadoop 

JobRatio 

TaskRatio 

NearestFit 

Hadoop 

JobRatio 

TaskRatio 

NearestFit 

WordCount 

wiki 

11.30 

13.02 

13.01 

12.44 

19.02 

21.34 

21.34 

20.80 

WordCount-NC 

wiki 

6.68 

0.48 

0.53 

0.95 

16.27 

1.56 

1.56 

3.10 

Invertedindex 

wiki 5K 

6.85 

4.11 

5.88 

8.03 

13.46 

12.86 

17.65 

18.59 

wiki 50K 

7.94 

12.34 

20.16 

1.76 

17.66 

21.05 

37.31 

5.51 


loc-Gowalla 

10.91 

39.47 

36.30 

4.90 

34.23 

87.51 

87.51 

12.34 


web-Google 

3.04 

9.40 

7.39 

0.76 

5.53 

16.02 

14.94 

1.73 

2PathGenerator 

web-Stcinf ord 

14.40 

25.35 

13.86 

1.19 

36.53 

48.29 

24.60 

2.95 

com-Youtube 

23.12 

31.78 

18.24 

3.09 

48.20 

58.29 

46.16 

7.15 


web-Berkstan 

27.51 

31.84 

22.29 

2.40 

51.81 

92.29 

91.85 

5.75 


as-Skitter 

26.21 

26.60 

21.46 

1.07 

46.30 

66.05 

60.51 

2.33 


loc-Gowalla 

7.18 

0.47 

0.66 

9.24 

11.74 

1.54 

2.07 

35.23 


web-Google 

1.81 

1.10 

0.53 

0.22 

4.50 

1.78 

0.79 

0.45 

TriangleCount 

web-Stcinf ord 

1.92 

0.57 

0.17 

0.10 

3.98 

0.84 

0.28 

0.29 

com-Youtube 

1.55 

0.96 

1.00 

0.90 

4.34 

1.92 

2.15 

2.01 


web-Berkstan 

3.30 

0.85 

1.58 

1.44 

8.77 

2.25 

4.49 

4.37 


as-Skitter 

4.43 

2.78 

2.68 

4.65 

7.02 

6.19 

5.94 

7.97 


linear 1.0 

6.53 

3.91 

1.46 

1.40 

16.41 

7.02 

2.81 

9.58 


linear 1.5 

25.85 

1.12 

1.14 

2.34 

48.55 

1.81 

1.81 

5.02 

NaturalJoin 

linear 2.0 

30.24 

3.55 

3.56 

0.76 

58.91 

7.37 

7.37 

2.72 


si 2.0 1.0 

31.81 

45.49 

45.62 

11.11 

70.17 

95.96 

95.96 

22.77 


si 1.5 

34.10 

48.14 

48.24 

1.68 

72.84 

97.37 

97.37 

3.73 

MatMult-Opt 


5.18 

11.16 

10.81 

1.43 

9.88 

21.59 

20.66 

7.41 

uniform 

7.01 

1.50 

1.36 

0.47 

12.46 

3.10 

2.84 

0.87 

MatMult-Rnd 

skewed 

12.51 

11.94 

11.23 

0.71 

22.56 

22.26 

21.16 

3.97 

uniform 

5.75 

0.81 

0.83 

0.30 

12.38 

1.64 

1.70 

0.49 

MatMult-Unbal 

skewed 

38.20 

9.09 

4.65 

0.14 

77.06 

15.04 

12.05 

2.08 

uniform 

14.92 

0.61 

0.72 

0.35 

37.91 

1.31 

1.62 

1.07 

Arithm. mean 


13.71 

12.53 

10.94 

2.73 

28.46 

26.45 

25.35 

7.05 


Table 3. Accuracy of progress indicators with 8 workers and single wave execution. 


for Ratio and Hadoop, as shown by the progress plot in Figure]^ 
(same parameters, 8 reduce tasks). 

8.2 Progress indicator accuracy 

In this section we compare the accuracy of NearestFit with the 
state-of-the-art progress indicators on the real benchmarks. The 
main outcome of our analysis is summarized in Table which 
reports the mean and maximum absolute percentage errors across 
all benchmarks and data sets, computed on the smallest cluster. 
The arithmetic mean of the errors is shown on the last line, and 
gives a clear clue on the accuracy of the different indicators. The 
scenario is quite interesting and diversified if we examine each 
specific benchmark. 

NearestFit is especially good at predicting progress for high 
time complexity and high skewness. The emblematic example is 
2PathGenerator, where both Hadoop and Ratio exhibit very 
poor accuracy. For instance, the maximum error of JobRatio can 
be as large as 92%. And there is not just a single wrong predic¬ 
tion across each execution, but predictions appear to be repeat¬ 
edly incorrect, as proved by the high values of the average error. 
TaskRatio is slightly better than JobRatio, though the two ap¬ 
proaches are overall quite similar. 


On benchmarks for which reduce functions are less computa¬ 
tionally demanding (linear time), both NearestFit and Ratio are 
accurate: the latter is sometimes better (see, e.g., TriangleCount 
on datasets as-Skitter or loc-Gowalla), but the average error 
of NearestFit is always very reasonable. The maximum error has 
a peak on loc-Gowalla, whose execution time is however quite 
short: the reduce phase takes less than 4 minutes. The different be¬ 
havior of Ratio on the quadratic benchmark 2PathGenerator 
and the linear benchmark TriangleCount is exemplified by 
Figure which shows the progress plots obtained for dataset 
as-Skitter. It is worth noticing that not only the error of Ratio 
is significant on benchmark 2PathGenerator, but there are also 
wide prediction fluctuations, ranging from underestimates to large 
overestimates. 

The Hadoop progress indicator is instead mainly affected by 
load balancing issues between tasks: load unbalancing can yield 
very wrong estimates, as exemplified for benchmarks MatMult-Dpt 
and MatMult-Unbal in Figure]^ while NearestFit is close to 
optimal and Ratio is reasonably accurate in both experiments, 
Hadoop predictions are very poor with the unbalanced partitioner. 
This is because the progress of short tasks becomes quickly 100% 
(short tasks complete early, as shown in the swimlanes plot), yield¬ 
ing a large average progress for the entire phase. 
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Figure 7. Progress prediction on a quadratic benchmark 
(2PathGenerator, left column) and a linear benchmark 
(TriangleCount, right column) on the smallest cluster and 
dataset as-Skitter. 
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Figure 8. Load balancing issues: MatrixMult-Opt and 
MatrixMult-Unbal for skewed matrices on the smallest 
cluster (progress plots and swimlanes plots). 
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Figure 10. Slowdown with respect to native execution on a selec¬ 
tion of representative benchmarks on three different clusters. 


reports for each benchmark the slowdown on three different clus¬ 
ters. On linear benchmarks and unskewed data, using larger clus¬ 
ters yields shorter running times. This is not the case on super- 
linear benchmarks and skewed data, where a larger degree of par¬ 
allelism cannot be fully exploited due to stragglers: the running 
time in 2PathGenerator and NaturalJoin, for instance, stays 
almost constant. The slowdown is always small: no execution was 
more than 1.06 x w.r.t. native execution. The average slowdown 
across all benchmarks, including those not reported in Figure [To] 
was 0.4%, as shown in Figure [TT e (8 workers, single wave). Fig¬ 
ure also shows that the average slowdown slightly increases on 
larger clusters: this is expected on average, as native executions can 
he shorter while the collected profile data stays the same. 

8.4 Space overhead 

We first focus on the profile size, which determines communication 
costs. As shown in Figure[TTJ), the space overhead is negligible and 
independent of the cluster size. The space usage is small because 
on the map side information on implicit keys is aggregated and on 
the reduce side we use smoothing (see Section [ST] and Section]^ 
respectively). Cluster independence follows from the observation 
that the number of map tasks depends only on the job input size. 
The number of reduce tasks, on the other side, changes on different 
clusters, but the number of spawned reduce functions is exactly the 
same: small - unpredictable - variabilities of the size of reduce task 
profiles are due to different scheduling strategies, that might yield 
different bursts (see Section|^. Benchmark-specific figures on the 
small cluster are given in Table 

We also analyzed the memory peak on worker nodes and on the 
application master. We omit detailed results since it turned out not 
to be a bottleneck. Memory usage in our implementation is indeed 
limited by fixed constants: A on the worker side, and space-efficient 
streaming data structures on the master side (see Section[5T). 


Figure 9. Accuracy of the progress indicators on different clusters 
for a single wave (left) or multiple waves (right) of reduce tasks. 

The error of Hadoop increases on larger clusters, where the 
impact of the many short tasks on progress analysis becomes more 
and more noticeable. This is confirmed by Figure]^ which plots the 
mean of the average errors across all benchmarks and datasets on 
three different clusters. Ratio and NearestFit are only slightly 
affected by larger degrees of parallelism, and the results of Table|^ 
are confirmed on all clusters, both for single and for multiple waves 
executions. 

8.3 Slovrdown 

We now discuss the overhead introduced by NearestFit on the 
wall-clock time of native executions. Performance figures on a se¬ 
lection of representative benchmarks are shown in Figure[T^ which 


8.5 Time/space/accuracy tradeoffs 

In Section]^ we introduced a distinction between implicit and ex¬ 
plicit keys and we observed that this is crucial to make NearestFit 
practical. We now back up this observation with experimental data. 

Figure[T^considers the 2PathGenerator and TriangleCount 
benchmarks on the com-Youtube dataset. We run NearestFit by 
exponentially increasing the number A of explicit keys (in all the 
experiments discussed so far A = 2000). When increasing A, we 
also change the size of the streaming data structure used by the 
application master to merge explicit keys: this is set to 35A, which 
is much smaller than the total number of explicit keys collected 
across all the 410 map tasks of com-Youtube. 

Figure[T2^ shows that accuracy does not benefit of larger values 
of A. On the other hand, as shown in Figure |12} i, the largest A 
values yield a steep increase of the running times, due to garbage 
collection and communication costs. Space usage is also harmed 
by A, since map task profiles become larger as A increases. On the 
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Dataset 

INPUT (MB) 

SHUFFLE DATA (MB) 

OUTPUT (MB) 

MAP PROFILE (MB) 

REDUCE PROFILE (MB) 

OVERHEAD (%) 

WordCount 

wiki 

50202.70 

9812.04 

5600.17 

20.01 

2.58 

0.23 

WordCount-NC 

wiki 

50202.70 

21715.84 

891.43 

21.12 

1.67 

0.10 

Invertedindex 

wiki 5K 

51112.74 

12962.94 

12784.75 

265.93 

5.25 

2.09 

wiki 50K 

51112.74 

18277.93 

18019.26 

1708.35 

7.62 

9.39 


loc-Gowalla 

32.03 

35.66 

4395.35 

0.38 

0.98 

3.82 


web-Google 

161.61 

178.09 

14322.27 

0.38 

4.31 

2.64 

2PathGenerator 

web-Stanford 

75.46 

83.06 

74930.03 

0.38 

2.07 

2.96 

com-Youtube 

108.35 

119.75 

25686.61 

0.39 

2.33 

2.27 


web-Berkstan 

259.96 

285.33 

501082.86 

0.78 

4.79 

1.95 


as-Skitter 

425.05 

467.38 

287677.55 

0.39 

8.95 

2.00 


loc-Gowalla 

4431.45 

4977.59 

9.74 

4.05 

4.94 

0.18 


web-Google 

14497.59 

15856.18 

59.39 

12.04 

12.66 

0.16 

TriangleCount 

web-Stanford 

75078.55 

82518.25 

28.30 

59.99 

32.88 

0.11 

com-Youtube 

25819.82 

28595.63 

19.47 

21.10 

17.40 

0.13 


web-Berkstan 

501832.44 

554678.13 

98.44 

399.34 

50.16 

0.08 


as-Skitter 

288383.69 

318605.68 

130.04 

229.31 

159.68 

0.12 


linear 1.0 

4283.23 

5,822.30 

83,924.16 

2.00 

1.69 

6.33E-02 


linear 1.5 

4258.59 

5797.72 

83923.36 

1.89 

0.73 

4.52E-02 

NaturalJoin 

linear 2.0 

4258.59 

5797.72 

83923.36 

1.89 

0.73 

4.52E-02 


si 2.0 1.0 

8.44 

11.49 

50758.64 

0.07 

0.08 

1.34 


si 1.5 

8.43 

11.48 

148455.27 

0.11 

0.02 

1.14 

MatMult-Opt 


1021.96 

8907.43 

14010.68 

0.02 

0.01 

4.18E-04 

uniform 

933.84 

7856.49 

15703.94 

0.02 

0.01 

4.53E-04 

MatMult-Rnd 

skewed 

1021.96 

8907.43 

14010.68 

0.02 

0.01 

4.18E-04 

uniform 

933.84 

7856.49 

15,703.94 

0.02 

0.01 

4.50E-04 

MatMult-Unbal 

skewed 

1021.96 

8907.43 

14010.68 

0.02 

0.01 

4.21E-04 

uniform 

933.84 

7856.49 

15703.94 

0.02 

0.01 

4.51E-04 


Table 4. Exact figures for the space overhead of NearestFit (8 workers, single wave). 




(a) (b) 


Figure 11. Average slowdown and space overhead for executions 
with single and multiples waves of reduce tasks. 


2PathGenerator, the overhead flattens at A = 2® x 1000 because 
the number of distinct keys is smaller than this value. 

Figure [Tgi plots the percentage of explicit keys after merg¬ 
ing, with respect to the total number of distinct keys. When A > 
2® X 1000, all the keys in 2PathGenerator are explicit as al¬ 
ready observed. There is a remarkable difference between the two 
benchmarks: in TriangleCount, the percentage of explicit keys 
is much smaller than 2PathGenerator (notice the log scale on 
the j/-axis). This is because map functions in TriangleCount pro¬ 
duce a large number of very small key groups. The flat trend for 
A > 2® X 1000 therefore depends on the streaming data struc¬ 
ture. Figure [T^ shows how much shuffle data is associated with 
explicit keys: a comparison of Figures [T^ and 1 12^ reveals that, 
even for A = 2000, 1% of keys is associated with 38% of shuffle 
data in 2PathGenerator, due to data skewness. This suggests that 
a small space is sufficient to characterize the heaviest keys without 
compromising accuracy. 

8.6 Nearest neighbor regression vs. curve fitting 

As a final experiment, we analyzed the interplay between near¬ 
est neighbor regression and curve fitting. Using Equation on 
the longest task, we computed the percentage of ri{t) obtained 
using curve fitting, i.e., using the / defined in Equation ^ In 
2PathGenerator, more than 75% of the predicted time is a ways 



Figure 13. Curve fitting vs. nearest neighbor regression: 
2PathGenerator (top) and TriangleCount (bottom) on 
web-Stanf ord. 


due to curve fitting (FigurepA^), while the percentage is negligible 
(even zero) for TriangleCount (Figure [T^), where all keys are 
very small and similar. 

We then evaluated the accuracy of a variant of NearestFit that 
is restricted to use only curve fitting (called Fit in Figure p3|>. As 
expected, whenever there is no significant skewness and reduce 
executions are all very short, the accuracy of Fit can be very 
poor, as in Figure [T3]l. Hence, the orderly combination of the two 
techniques appears to be crucial to obtain good progress estimates. 

9. Related work 

Designing profiling methodologies able to provide insights into 
the performance of data-parallel jobs and their scaling properties 
is extremely difficult. In this section we focus on the class of 
MapReduce computations, which are the target of this paper. 
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Figure 12. Time/space/accuracy tradeoffs with different numbers of explicit keys: (a) accuracy; (b) slowdown; (c) space overhead; (d) 
percentage of explicit keys; (e) percentage of the shuffle data associated to explicit keys. 


HiTune fT) is a lightweight performance analyzer for Hadoop 
that allows users to identify application hot-spots and hardware 
problems. PerfXplain ca addresses performance debugging by 
comparing the profiles of different MapReduce jobs: programmers 
specify performance queries and the tool generates explanations 
obtained from the analysis of a log of past job executions. Simi¬ 
larly, self-tuning systems such as Starfish (SI can help users un¬ 
derstand and optimize many configuration parameters of Hadoop. 
With respect to our work, these papers target different goals. 

A different line of research has addressed the problem of perfor¬ 
mance prediction in the context of MapReduce-style applications. 
Accurate predictions can be exploited for different practical pur¬ 
poses, ranging from more precise progress indicators to different 
optimization strategies able to reduce the job completion time. Par¬ 
allax (SI is a progress indicator for MapReduce pipelines based on 
a linear prediction model similar to JobRatio. It can use job pro¬ 
files collected on past executions. Paratimer (241 extends Parallax 
towards workflows characterized by complex DAGs: it can predict 
the progress of parallel queries by analyzing the critical path in the 
computation. ARIA 0^ is a framework for automatically allocat¬ 
ing the proper amount of cluster resources in order to complete a 
job within a certain (soft) time deadline. Exploiting job profiles 
collected on past executions, the linear performance model pro¬ 
posed by ARIA determines the task parallelism sufficient to meet 
the deadline. To the best of our knowledge (and as shown in this pa¬ 
per), none of these works is able to predict accurately the running 
time of MapReduce applications in presence of both data skewness 
and super-linear reduce functions. 

Uneven data distribution is one of the main reasons for struggler 
tasks in MapReduce jobs. Several works (niniiiiiiiii have thus 
approached this challenge: their main idea is to detect stragglers 
and split them as soon as possible in order to reduce the job 
completion time. Akin to this paper, Dl considers the problem 
of super-linear reduce implementations alongside data skewness, 
but requires a user-defined cost model for predicting the reduce 
running time. Moreover, the size 1141 of the key group processed 
by a reduce function is not collected explicitly, but is estimated as 
an average among the sizes of all the key groups assigned to the 
reduce task. Though efficient, this could be rather inaccurate. Load 
balancing for reduce tasks is also addressed in du, where map 
outputs are sampled and the most frequent keys are detected: this is 
similar to our notion of explicit keys, but we avoid sampling thanks 
to the use of streaming data structures. 

10. Concluding remarks 

In this paper we have introduced the NearestFit progress indica¬ 
tor. NearestFit targets accuracy of progress predictions, even in 
the presence of data skewness and super-linear reduce implementa¬ 
tions, thanks to a careful combination of two learning techniques: 
nearest neighbor regression and statistical curve fitting. It also tar¬ 
gets practical feasibility by exploiting different space-efficient data 
structures and data streaming algorithms. An extensive empirical 


assessment over the Amazon EC2 platform and several benchmarks 
has confirmed the precision and effectiveness of our model, that can 
be accurate even where competitors can be seriously harmed (i.e., 
in the presence of high computation times and skewed data). We 
believe that the better accuracy of NearestFit can be beneficial 
also in other settings, where performance prediction is used as a 
building block for pursuing sophisticated profile-guided optimiza¬ 
tions. 

There are some assumptions that can be regarded as threats to 
validity of our approach: 

• Performance as a function of the input size. Our model makes 
the assumption that f{k\,Vkf) ~ 7(^2 , Vk^ ) whenever \ Vki\ ~ 
I Vk 2 1 • In some applications, the performance could instead de¬ 
pend on the input values, and not on the input size: different 
values might yield quite different processing times even for key 
groups with similar size. We are not aware of MapReduce ap¬ 
plications where this happens, and in general our approach is 
akin to algorithmic asymptotic analysis and to previous input- 
sensitive profiling works (mu. 

• Uninformative profiling data. Applications characterized by a 
very small number of distinct reduce keys can undermine our 
approach, since the profiling data collected across all tasks 
may be uninformative, preventing NearestFit to take benefit 
of both nearest neighbor regression and curve fitting. These 
applications, however, would not exploit parallelism at all in 
the reduce phase, and are unlikely to happen in realistic big data 
computing scenarios. 

Two aspects that we have not addressed explicitly in this paper are 
task failures and heterogeneous clusters. Different progress pre¬ 
dictions (e.g., best case vs. worst case) could be returned to as¬ 
sess the impact of different failure scenarios on the job progress. 
With respect to heterogeneous clusters, a guiding design principle 
of NearestFit is to exploit as much as possible task-specific in¬ 
formation (points 1 and 2 in the algorithm combination described in 
Section|3^. Global information (points 3 and 4) is used as failsafe 
strategy and should take into account different node characteristics 
in heterogeneous clusters. As a future work, we plan to extend our 
implementation towards these directions. 

A very interesting research problem is how to apply non-linear 
performance models to predict the overall progress of pipelines of 
jobs and workflows characterized by complex DAGs (such as those 
produced by Pig (27)). This appears to be challenging if profiles 
have to be dynamically collected during the actual workflow exe¬ 
cution, without resorting to historical information. 
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Additional examples of progress and swimlanes plots 
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Figure 14. Progress and swimlanes plots: Invertedindex on 
wiki 50K, 8 workers, single wave. 



Figure 15. Progress and swimlanes plots: 2PathGenerator (left) vs. 
TriangleCount (right) on as-Skitter, 8 workers, double wave. 
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Figure 16. Progress and swimlanes plots of Natural Join, 8 workers, 
on dataset linear 1.5 with single wave (left) and dataset si 1.5 with 
double wave (right). 
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